docs(longhaul): add long-haul test design document#400
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds (or refreshes) the long-haul (canary) test driver for the DocumentDB Kubernetes Operator: a standalone Go module under test/longhaul/ with writers/verifiers, disruption-window journaling, a weighted-random operations scheduler (scale + DocumentDB upgrade), health/leak monitoring, periodic reporting to a longhaul-report ConfigMap, plus in-cluster Deployment packaging and GitHub Actions workflows (build/deploy/monitor). It also updates the long-haul design document and supporting README/manifests.
Changes:
- Introduces the long-haul test driver Go module (
test/longhaul/) with workload, monitor, operations, journal, and reporting components. - Adds Kubernetes deployment artifacts (Deployment + RBAC + setup manifest) and CI workflows to build, deploy, and monitor the long-haul canary.
- Updates
docs/designs/long-haul-test-design.mdto describe the architecture, invariants, and operations catalog.
Reviewed changes
Copilot reviewed 37 out of 38 changed files in this pull request and generated 24 comments.
Show a summary per file
| File | Description |
|---|---|
test/longhaul/workload/writer.go |
Writer loop that inserts majority-acknowledged documents with checksum/seq tracking. |
test/longhaul/workload/verifier.go |
Periodic verifier scanning for gaps and checksum mismatches under majority read concern. |
test/longhaul/workload/metrics.go |
Atomic counters + snapshot helpers for workload metrics. |
test/longhaul/report/suite_test.go |
Ginkgo suite bootstrap for report package tests. |
test/longhaul/report/report.go |
Markdown report generator for long-haul run state. |
test/longhaul/report/checkpoint.go |
Periodic reporter writing stdout + longhaul-report ConfigMap. |
test/longhaul/report/checkpoint_test.go |
Unit tests for ConfigMap create/update/result-field behavior. |
test/longhaul/report/alert.go |
GitHub Actions annotations for pass/fail/leak warnings. |
test/longhaul/README.md |
Driver usage, deployment instructions, and config reference. |
test/longhaul/operations/upgrade.go |
DocumentDB version upgrade operation using desired-version ConfigMap + steady-state gate. |
test/longhaul/operations/suite_test.go |
Ginkgo suite bootstrap for operations tests. |
test/longhaul/operations/scheduler.go |
Weighted-random scheduler with cooldown + steady-state gating + disruption windows. |
test/longhaul/operations/scheduler_test.go |
Unit tests for weighted selection + cooldown short-circuiting. |
test/longhaul/operations/scale.go |
Scale up/down operations with patch-confirmation polling and outage policies. |
test/longhaul/monitor/suite_test.go |
Ginkgo suite bootstrap for monitor tests. |
test/longhaul/monitor/leakdetect.go |
Linear-regression leak detector over sampled memory/CPU. |
test/longhaul/monitor/k8sclient.go |
Real Kubernetes client implementation (pods/CR/metrics, CR patching). |
test/longhaul/monitor/health.go |
Health monitor with steady-state tracking and recovery waits. |
test/longhaul/monitor/health_test.go |
Unit tests for steady-state and wait semantics using a fake ClusterClient. |
test/longhaul/journal/suite_test.go |
Ginkgo suite bootstrap for journal tests. |
test/longhaul/journal/policy.go |
Outage policy + disruption window evaluation logic. |
test/longhaul/journal/policy_test.go |
Unit tests pinning boundary behavior of the verdict oracle. |
test/longhaul/journal/journal.go |
Thread-safe append-only journal + disruption-window tracking. |
test/longhaul/journal/journal_test.go |
Unit tests for journal behavior and concurrency safety. |
test/longhaul/go.sum |
Dependency lockfile for the long-haul module. |
test/longhaul/go.mod |
New standalone Go module for the long-haul driver. |
test/longhaul/Dockerfile |
Multi-stage container build for the long-haul binary. |
test/longhaul/deploy/setup.yaml |
Namespace + DocumentDB CR bootstrap manifest for the canary cluster. |
test/longhaul/deploy/rbac.yaml |
ServiceAccount + Role/Bindings + metrics ClusterRole for the driver. |
test/longhaul/deploy/deployment.yaml |
ConfigMap-driven Deployment manifest for in-cluster execution. |
test/longhaul/config/suite_test.go |
Ginkgo suite bootstrap for config tests. |
test/longhaul/config/config.go |
Env-driven config loading + validation for the driver. |
test/longhaul/config/config_test.go |
Unit tests for env parsing, validation, and enable flag parsing. |
test/longhaul/cmd/longhaul/main.go |
Standalone binary wiring: Mongo workload, ops scheduler, monitoring, reporting. |
docs/designs/long-haul-test-design.md |
Updated design doc describing architecture, invariants, and phases. |
.github/workflows/longhaul-monitor.yaml |
Hourly monitor workflow for Deployment health/report staleness + version publishing. |
.github/workflows/longhaul-image-build.yml |
Workflow to build/push the long-haul driver image to GHCR. |
.github/workflows/longhaul-deploy.yml |
Workflow to roll the driver Deployment on AKS (manual + workflow_run). |
|
🤖 Auto-triaged by documentdb-triage-tool. Applied: Reasoningcomponent from path globs (test, ci, docs, dependencies); effort from diff stats (5023+0 LOC, 38 files); LLM: Single-file docs-only update to a design document with no build or test impact, part of a larger split PR series. If a label is wrong, remove it manually and ping |
15eb6f4 to
ff2c1cb
Compare
031d0a5 to
f848d41
Compare
f848d41 to
fc797a3
Compare
Adds the design doc covering goals, architecture (writer/verifier loop, operations scheduler, monitor, journal, report), data plane invariants (majority writes, gap detection, checksum validation), failure modes, and relationship to test/e2e. Split from documentdb#348 as a standalone reviewable PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
fc797a3 to
bdc8cf0
Compare
…eate) Address @hossain-rayhan feedback on documentdb#400: the doc didn't make explicit that a Fatal failure preserves the cluster for post-mortem rather than auto-recreating it, and that recovery is manually triggered after a maintainer reviews the alert from the monitor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
| | **Journal** | In-process append-only event log shared by all components. | Reproducible event stream for the report. | | ||
| | **Report** | Aggregates the journal into a markdown summary at a configurable interval; raises alerts on threshold breaches. | Markdown report; alert lines. | | ||
|
|
||
| ### Cluster Topology |
There was a problem hiding this comment.
we should mention that where possible we want to reuse the code from the e2e tests (e.g. client)
There was a problem hiding this comment.
Added a "Code reuse" paragraph at the end of the Architecture section in ab9a009 (will update SHA in reply if push lands differently):
Where possible, the driver consumes the same helpers as the e2e suite — the Mongo client, DocumentDB lifecycle operations (create / patch / wait-healthy / delete), and TLS plumbing all live in a shared
test/sharedGo module. This keeps long-haul behavior aligned with what e2e exercises and avoids two diverging mongo-driver wrappers.
That test/shared module is what #401 extracts — long-haul will consume it from day one.
|
|
||
| ## Lifecycle | ||
|
|
||
| The test runs **continuously** — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production. |
There was a problem hiding this comment.
Please eleborate how we deal with different versions, e.g. are we always runnign the latest, current, etc. When are we updating? Part of the test?
Is there a point when we start over? What's the criteria?
|
|
||
| The test runs **continuously** — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production. | ||
|
|
||
| **Workload runs through upgrades.** No drain, no quiesce. Draining before upgrade hides exactly the upgrade-under-state bugs we're testing. |
There was a problem hiding this comment.
Aevwe also downgrading? Or are we starting over at some point so we can test upgrade more than once?
| | **Lifecycle** | DocumentDB version upgrade, operator upgrade | | ||
| | **HA** | controlled failover | | ||
| | **Chaos** | kill primary pod, drain node | | ||
| | **Data protection** | trigger backup, verify backup | |
There was a problem hiding this comment.
do we have operator upgrades as well?
Operator chaos?
Remote nodes/mukti-region? (maybe not now but potentially planned in the future)
| - One disruptive op at a time. Overlapping disruptions are non-diagnosable. | ||
| - Per-category cooldown between ops. Lets the cluster stabilize. | ||
| - Steady-state gate — health check must pass before the next op fires. | ||
| - Backup isolation — no topology changes during backup. |
There was a problem hiding this comment.
why not? Backup should block/delay - this should be handled by backup
There was a problem hiding this comment.
Agreed — dropped the bullet (d1694f5). Backup-vs-topology is the backup feature's job; isolating it in the harness would hide exactly the bugs we want to catch.
| **Per-component attribution.** Metrics are tagged by component (operator pod RSS, DB pod RSS, goroutine count, reconcile rate, API-call rate). Without separate series, a memory climb at hour 30 is undiagnosable. | ||
|
|
||
| **Human-in-the-loop alerts.** The hourly monitor posts a summary to the workflow run and, when configured, to a chat channel. A maintainer reviews the evidence and manually creates a GitHub issue. No auto-filed issues — alert fatigue from transient or infrastructure failures would erode trust in the canary. | ||
|
|
There was a problem hiding this comment.
we shoudl also record the system dashboard metrics (latency, uptime, etc.) as well as logs of all components for later analysis; where do we keep them?
There was a problem hiding this comment.
Added an Artifact Retention subsection in 314f3dc. Two tiers:
- Rolling status —
longhaul-reportConfigMap polled by the monitor workflow (this part is already used in the existing driver). - Forensics bundle — pod logs, events, CR snapshots, metric samples, journal — uploaded as a GitHub Actions artifact on every Tier-1 / Tier-2 alert and at end of run.
Operational details (which collectors, sanitization rules, bundle layout) are kept in test/longhaul/README.md rather than the design doc, so the design stays high-level.
| | **CloudNative-PG** | Failover via pod delete + SIGSTOP; pod-level resource sampling | Ginkgo framework (we use a long-lived `Deployment` instead) | | ||
| | **CockroachDB** | Chaos runner; separate workload from disruption; roachstress | Custom roachtest framework (too heavy) | | ||
| | **Vitess** | Background stress goroutine; per-query tracking | No fault injection (we need disruptive ops) | | ||
|
|
There was a problem hiding this comment.
We are also interested how FoundationDB tests (they turned their approach into Anithesis) - not sure if they cover long haul though
There was a problem hiding this comment.
Added FoundationDB and Antithesis as separate rows in the Learnings table (309aefc). Short answer to your question: neither covers long-haul — both run in simulated time on fake network/disk, so they catch rare-interleaving logic bugs in seconds but can't surface the wall-clock accumulation bugs (mem leaks, lock-table bloat, CR-history drift) that need real reconciliation cycles over real days. We adopt their property-based oracle and workload/fault separation.
|
|
||
| ## Open Questions | ||
|
|
||
| 1. Multi-region canary scope — AKS Fleet integration? |
There was a problem hiding this comment.
Agreed — renamed Open Questions -> Future Scope and reworded the multi-region item so it reads as explicitly deferred (still a candidate before GA if scope allows). See 8a79edd.
Per @xgerman feedback, call out that both Primary and Baseline run with production-style podAntiAffinity and a PodDisruptionBudget so chaos and upgrade operations exercise operator/DB bugs rather than misconfiguration failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman feedback, state explicitly that the driver reuses e2e helpers (Mongo client, DocumentDB lifecycle ops, TLS plumbing) from the shared test/shared Go module rather than forking them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, multi-region (AKS Fleet) is deferred. Rename the Open Questions section to Future Scope and reword the item so the deferred status is explicit; no open design questions remain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, call out FDB Simulation and Antithesis. Both are deterministic simulation tools that target rare-interleaving logic bugs in simulated time; they don't cover the wall-clock accumulation bugs that long-haul exists for. We adopt their property-based oracle and workload/fault separation, not the simulation engine itself. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, spell out where evidence is kept. Two tiers: a rolling status summary in the longhaul-report ConfigMap (already used by the monitor workflow), and a forensics bundle uploaded as a GitHub Actions artifact on alert and at end of run. Operational details (collectors, sanitization, layout) belong in test/longhaul/README.md, not the design. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, isolating backup from topology hides exactly the serialization bugs long-haul should catch. Backup-vs-topology is the backup feature's responsibility, not the harness's. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Part 1/5 of #348 split.
Scope
Adds
docs/designs/long-haul-test-design.md(367 lines, new file).Content
spec.instancesPerNodescalingtest/e2eVerification
Docs-only; no build/test impact.
Related
Splits #348 into 5 focused PRs:
test/sharedmodule extraction + e2e migration (can land in parallel)